92 research outputs found
A Discriminative Representation of Convolutional Features for Indoor Scene Recognition
Indoor scene recognition is a multi-faceted and challenging problem due to
the diverse intra-class variations and the confusing inter-class similarities.
This paper presents a novel approach which exploits rich mid-level
convolutional features to categorize indoor scenes. Traditionally used
convolutional features preserve the global spatial structure, which is a
desirable property for general object recognition. However, we argue that this
structuredness is not much helpful when we have large variations in scene
layouts, e.g., in indoor scenes. We propose to transform the structured
convolutional activations to another highly discriminative feature space. The
representation in the transformed space not only incorporates the
discriminative aspects of the target dataset, but it also encodes the features
in terms of the general object categories that are present in indoor scenes. To
this end, we introduce a new large-scale dataset of 1300 object categories
which are commonly present in indoor scenes. Our proposed approach achieves a
significant performance boost over previous state of the art approaches on five
major scene classification datasets
Open-Vocabulary Object Detection via Scene Graph Discovery
In recent years, open-vocabulary (OV) object detection has attracted
increasing research attention. Unlike traditional detection, which only
recognizes fixed-category objects, OV detection aims to detect objects in an
open category set. Previous works often leverage vision-language (VL) training
data (e.g., referring grounding data) to recognize OV objects. However, they
only use pairs of nouns and individual objects in VL data, while these data
usually contain much more information, such as scene graphs, which are also
crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based
Discovery Network (SGDN) that exploits scene graph cues for OV detection.
Firstly, a scene-graph-based decoder (SGDecoder) including sparse
scene-graph-guided attention (SSGA) is presented. It captures scene graphs and
leverages them to discover OV objects. Secondly, we propose scene-graph-based
prediction (SGPred), where we build a scene-graph-based offset regression
(SGOR) mechanism to enable mutual enhancement between scene graph extraction
and object localization. Thirdly, we design a cross-modal learning mechanism in
SGPred. It takes scene graphs as bridges to improve the consistency between
cross-modal embeddings for OV object classification. Experiments on COCO and
LVIS demonstrate the effectiveness of our approach. Moreover, we show the
ability of our model for OV scene graph detection, while previous OV scene
graph generation methods cannot tackle this task
Towards Robust and Reproducible Active Learning Using Neural Networks
Active learning (AL) is a promising ML paradigm that has the potential to
parse through large unlabeled data and help reduce annotation cost in domains
where labeling entire data can be prohibitive. Recently proposed neural network
based AL methods use different heuristics to accomplish this goal. In this
study, we show that recent AL methods offer a gain over random baseline under a
brittle combination of experimental conditions. We demonstrate that such
marginal gains vanish when experimental factors are changed, leading to
reproducibility issues and suggesting that AL methods lack robustness. We also
observe that with a properly tuned model, which employs recently proposed
regularization techniques, the performance significantly improves for all AL
methods including the random sampling baseline, and performance differences
among the AL methods become negligible. Based on these observations, we suggest
a set of experiments that are critical to assess the true effectiveness of an
AL method. To facilitate these experiments we also present an open source
toolkit. We believe our findings and recommendations will help advance
reproducible research in robust AL using neural networks
Unified Open-Vocabulary Dense Visual Prediction
In recent years, open-vocabulary (OV) dense visual prediction (such as OV
object detection, semantic, instance and panoptic segmentations) has attracted
increasing research attention. However, most of existing approaches are
task-specific and individually tackle each task. In this paper, we propose a
Unified Open-Vocabulary Network (UOVN) to jointly address four common dense
prediction tasks. Compared with separate models, a unified network is more
desirable for diverse industrial applications. Moreover, OV dense prediction
training data is relatively less. Separate networks can only leverage
task-relevant training data, while a unified approach can integrate diverse
training data to boost individual tasks. We address two major challenges in
unified OV prediction. Firstly, unlike unified methods for fixed-set
predictions, OV networks are usually trained with multi-modal data. Therefore,
we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism
to better leverage multi-modal data. Secondly, because UOVN uses data from
different tasks for training, there are significant domain and task gaps. We
present a UOVN training mechanism to reduce such gaps. Experiments on four
datasets demonstrate the effectiveness of our UOVN
Adversarial Training of Variational Auto-encoders for High Fidelity Image Generation
Variational auto-encoders (VAEs) provide an attractive solution to image
generation problem. However, they tend to produce blurred and over-smoothed
images due to their dependence on pixel-wise reconstruction loss. This paper
introduces a new approach to alleviate this problem in the VAE based generative
models. Our model simultaneously learns to match the data, reconstruction loss
and the latent distributions of real and fake images to improve the quality of
generated samples. To compute the loss distributions, we introduce an
auto-encoder based discriminator model which allows an adversarial learning
procedure. The discriminator in our model also provides perceptual guidance to
the VAE by matching the learned similarity metric of the real and fake samples
in the latent space. To stabilize the overall training process, our model uses
an error feedback approach to maintain the equilibrium between competing
networks in the model. Our experiments show that the generated samples from our
proposed model exhibit a diverse set of attributes and facial expressions and
scale up to high-resolution images very well
A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
We propose a Transformer architecture for volumetric segmentation, a
challenging task that requires keeping a complex balance in encoding local and
global spatial cues, and preserving information along all axes of the volume.
Encoder of the proposed design benefits from self-attention mechanism to
simultaneously encode local and global cues, while the decoder employs a
parallel self and cross attention formulation to capture fine details for
boundary refinement. Empirically, we show that the proposed design choices
result in a computationally efficient model, with competitive and promising
results on the Medical Segmentation Decathlon (MSD) brain tumor segmentation
(BraTS) Task. We further show that the representations learned by our model are
robust against data corruptions.
\href{https://github.com/himashi92/VT-UNet}{Our code implementation is publicly
available}
Striking the Right Balance with Uncertainty
Learning unbiased models on imbalanced datasets is a significant challenge.
Rare classes tend to get a concentrated representation in the classification
space which hampers the generalization of learned boundaries to new test
examples. In this paper, we demonstrate that the Bayesian uncertainty estimates
directly correlate with the rarity of classes and the difficulty level of
individual samples. Subsequently, we present a novel framework for uncertainty
based class imbalance learning that follows two key insights: First,
classification boundaries should be extended further away from a more uncertain
(rare) class to avoid overfitting and enhance its generalization. Second, each
sample should be modeled as a multi-variate Gaussian distribution with a mean
vector and a covariance matrix defined by the sample's uncertainty. The learned
boundaries should respect not only the individual samples but also their
distribution in the feature space. Our proposed approach efficiently utilizes
sample and class uncertainty information to learn robust features and more
generalizable classifiers. We systematically study the class imbalance problem
and derive a novel loss formulation for max-margin learning based on Bayesian
uncertainty measure. The proposed method shows significant performance
improvements on six benchmark datasets for face verification, attribute
prediction, digit/object classification and skin lesion detection.Comment: CVPR 201
- …